Approximate  Nearest  Neighbors:  Towards  Removing  the 

Curse  of  Dimensionality 

(preliminary  version) 

Piotr  Indyk*  Rajeev  MotwanR 

Department  of  Computer  Science 
Stanford  University 
Stanford,  CA  94305 

{indyk , raj  eev}@cs . Stanford . edu 

July  21,  1999 


Abstract 

The  nearest  neighbor  problem  is  the  following:  Given  a  set  of  n  points  P  =  {p i, . . .  ,p„}  in 
some  metric  space  X ,  preprocess  P  so  as  to  efficiently  answer  queries  which  require  finding  the 
point  in  P  closest  to  a  query  point  q  6  X.  We  focus  on  the  particularly  interesting  case  of  the 
d-dimensional  Euclidean  space  where  X  =  under  some  lp  norm.  Despite  decades  of  effort, 
the  current  solutions  are  far  from  satisfactory;  in  fact,  for  large  d ,  in  theory  or  in  practice,  they 
provide  little  improvement  over  the  brute-force  algorithm  which  compares  the  query  point  to 
each  data  point.  Of  late,  there  has  been  some  interest  in  the  approximate  nearest  neighbors 
problem,  which  is:  Find  a  point  p  £  P  that  is  an  e-approximate  nearest  neighbor  of  the  query 
q  in  that  for  all  p'  6  P,  d(p,  q)  <  (1  +  e)d(p' ,  q). 

We  present  two  algorithmic  results  for  the  approximate  version  that  significantly  improve  the 
known  bounds:  (a)  preprocessing  cost  polynomial  in  n  and  d,  and  a  truly  sublinear  query  time; 
and,  (b)  query  time  polynomial  in  log  n  and  d,  and  only  a  mildly  exponential  preprocessing  cost 
O(n)  xO( l/e)d .  Further,  applying  a  classical  geometric  lemma  on  random  projections  (for  which 
we  give  a  simpler  proof),  we  obtain  the  first  known  algorithm  with  polynomial  preprocessing 
and  query  time  polynomial  in  d  and  logn.  Unfortunately,  for  small  e,  the  latter  is  a  purely 
theoretical  result  since  the  exponent  depends  on  1/e.  Experimental  results  indicate  that  our 
first  algorithm  offers  orders  of  magnitude  improvement  on  running  times  over  real  data  sets.  Its 
key  ingredient  is  the  notion  of  locality- sensitive  hashing  which  may  be  of  independent  interest; 
here,  we  give  applications  to  information  retrieval,  pattern  recognition,  dynamic  closest-pairs, 
and  fast  clustering  algorithms. 
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1  Introduction 


The  nearest  neighbor  search  (NNS)  problem  is:  Given  a  set  of  n  points  P  —  {pi, .  . .  ,pn}  in  a 
metric  space  X  with  distance  function  d,  preprocess  P  so  as  to  efficiently  answer  queries  for  finding 
the  point  in  P  closest  to  a  query  point  q  G  X.  We  focus  on  the  particularly  interesting  case  of  the 
d-dimensional  Euclidean  space  where  X  =  W1  under  some  lp  norm.  The  low-dimensional  case  is 
well-solved  [27],  so  the  main  issue  is  that  of  dealing  with  the  “curse  of  dimensionality”  [17].  The 
problem  was  originally  posed  in  the  1960s  by  Minsky  and  Papert  [54,  pp.  222-225],  and  despite 
decades  of  effort  the  current  solutions  are  far  from  satisfactory.  In  fact,  for  large  d,  in  theory  or 
in  practice,  they  provide  little  improvement  over  a  brute-force  algorithm  which  compares  a  query 
q  to  each  p  G  P.  The  known  algorithms  are  of  two  types:  (a)  low  preprocessing  cost  but  query 
time  linear  in  n  and  d:  and,  (b)  query  time  sublinear  in  n  and  polynomial  in  d,  but  with  severely 
exponential  preprocessing  cost  nd.  This  unfortunate  situation  carries  over  to  average-case  analysis, 
and  even  to  the  e-approximate  nearest  neighbors  (e-NNS)  problem:  Find  a  point  p  €  P  that 
is  an  e-approximate  nearest  neighbor  of  the  query  q ,  in  that  for  all  p'  6  P ,  d(p,  q)  <  (1  +  e)d(p/,  q). 

We  present  two  algorithms  for  the  approximate  version  that  significantly  improve  the  known 
bounds:  (a)  preprocessing  cost  polynomial  in  n  and  d,  and  a  truly  sublinear  query  time;  and,  (b) 
query  time  polynomial  in  log??,  and  d,  and  only  a  mildly  exponential  preprocessing  cost  0(n)  X 
0(l/e)d.  Further,  by  applying  a  classical  geometric  lemma  on  random  projections  (for  which  we 
give  a  simpler  proof),  we  obtain  the  first  known  algorithm  with  polynomial  preprocessing  and  query 
time  polynomial  in  d  and  logn.  Unfortunately,  for  small  e,  this  is  a  purely  theoretical  result  as  the 
exponent  depends  on  1/e.  Experimental  results  [37]  indicate  that  the  first  algorithm  offers  orders 
of  magnitude  improvement  on  running  times  over  real  data  sets.  Its  key  ingredient  is  the  notion  of 
locality-sensitive  hashing  which  may  be  of  independent  interest;  we  give  applications  to  information 
retrieval,  pattern  recognition,  dynamic  closest-pairs,  and  fast  clustering. 

Motivation.  The  nearest  neighbors  problem  is  of  major  importance  to  a  variety  of  applications, 
usually  involving  similarity  searching.  Some  examples  are:  data  compression  [36];  databases  and 
data  mining  [13,  39];  information  retrieval  [11,  21,  58];  image  and  video  databases  [29,  31,  56, 
61];  machine  learning  [19];  pattern  recognition  [20,  26];  and,  statistics  and  data  analysis  [22,  45]. 
Typically,  the  features  of  the  objects  of  interest  (documents,  images,  etc)  are  represented  as  points 
in  iR.d  and  a  distance  metric  is  used  to  measure  (dis) similarity  of  objects.  The  basic  problem 
then  is  to  perform  indexing  or  similarity  searching  for  query  objects.  The  number  of  features 
(i.e. ,  the  dimensionality)  ranges  anywhere  from  tens  to  thousands.  For  example,  in  multimedia 
applications  such  as  IBM’s  QBIC  (Query  by  Image  Content) ,  the  number  of  features  could  be  several 
hundreds  [29,  31].  In  information  retrieval  for  text  documents,  vector-space  representations  involve 
several  thousands  of  dimensions,  and  it  is  considered  to  be  a  dramatic  improvement  that  dimension- 
reduction  techniques,  such  as  LSI  (latent  semantic  indexing)  [9,  11,  21],  principal  components 
analysis  [40]  or  the  Karluinen  l.oeve  transform  [44,  50],  can  reduce  the  dimensionality  to  a  mere 
few  hundreds! 

Of  late,  there  has  been  an  increasing  interest  in  avoiding  the  curse  of  dimensionality  by  resorting 
to  approximate  nearest  neighbor  searching.  Since  the  selection  of  features  and  the  use  of  a  distance 
metric  in  the  applications  are  rather  heuristic  and  merely  an  attempt  to  make  mathematically 
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precise  what  is  after  all  an  essentially  aesthetic  notion  of  similarity,  it  seems  like  an  overkill  to 
insist  on  the  absolute  nearest  neighbor;  in  fact,  determining  an  e- approximate  nearest  neighbor  for  a 
reasonable  value  of  e,  say  a  small  constant,  should  suffice  for  most  practical  purposes.  Unfortunately, 
even  this  relaxation  of  goals  has  not  removed  the  curse  of  dimensionality,  although  the  recent  results 
of  Kleinberg  [46]  gives  some  improvements. 

Previous  Work.  Sa.met  [59]  surveys  a  variety  of  data  structures  for  nearest  neighbors  including 
variants  of  &-d  trees,  R- trees,  and  structures  based  on  space-filling  curves;  more  recent  results 
are  surveyed  in  [60].  While  some  perform  well  in  2-3  dimensions,  in  high-dimensional  spaces  they 
all  exhibit  poor  behavior  in  the  worst  case  and  in  typical  cases  as  well  (e.g.,  see  Arya,  Mount, 
and  Narayan  [4]).  Dobkin  and  Lipton  [23]  were  the  first  to  provide  an  algorithm  for  nearest 
neighbors  in  4? A  with  query  time  0(2dlogn)  and  preprocessing1  cost  0(n2d+1).  Clarkson  [16] 
reduced  the  preprocessing  to  0(nr<i/2l(1+4)),  while  increasing  the  query  time  to  O (2° ( l°s d)  log  n). 
Later  results,  e.g.,  Agarwal  and  Matousek  [1],  Matousek  [51],  and  Yao  and  Yao  [65],  all  suffer 
from  a  query  time  that  is  exponential  in  d.  Meiser  [52]  obtained  query  time  0(d5  log  re)  but  after 
0(nd+s)  preprocessing.  The  so-called  “vantage  point”  technique  [12,  13,  62,  63]  is  a  recently 
popular  heuristic,  but  we  are  not  aware  of  any  analysis  for  high-dimensional  Euclidean  spaces.  In 
general,  even  the  average-case  analysis  of  heuristics  for  points  distributed  over  regions  in  gives 
an  exponential  query  time  [7,  35,  59]. 

The  situation  is  only  slightly  better  for  approximate  nearest  neighbors.  Arya  and  Mount  [3]  gave 
an  algorithm  with  query  time  O ( 1/e) dO (log  re)  and  preprocessing  0(l/e)d0(n).  The  dependence 
on  e  was  later  reduced  by  Clarkson  [17]  and  Chan  [15]  to  e-U-1)/2.  Arya,  Mount,  Netanyahu, 
Silverman,  and  Wu  [5]  obtained  optimal  0(n)  preprocessing  cost,  but  with  query  time  growing  as 
0(dd).  Bern  [8]  and  Chan  [15]  considered  error  e  polynomial  in  d  and  managed  to  avoid  exponential 
dependence  in  that  case.  Recently,  Kleinberg  [46]  gave  an  algorithm  with  0(n  log  d)2d  preprocessing 
and  query  time  polynomial  in  d.  e,  and  log  re,  and  another  algorithm  with  preprocessing  polynomial 
in  d,  e,  and  re.  but  with  query  time  0(n  +  dlog3  re).  The  latter  improves  the  O(dn)  time  bound  of 
the  brute-force  algorithm. 

For  the  Hamming  space  {0,  l}d,  Dolev,  Ha.rari,  and  Parnas  [25]  and  Dolev,  Hara.ri,  Linial,  Nisa.n, 
and  Parnas  [24]  gave  algorithms  for  retrieving  all  points  within  distance  r  of  the  query  q.  Unfor¬ 
tunately,  for  arbitrary  r,  these  algorithms  are  exponential  either  in  query  time  or  preprocessing. 
Greene,  Parnas,  and  Yao  [38]  present  a  scheme  which,  for  binary  data,  chosen  uniformly  at  random , 
retrieves  all  points  within  distance  r  of  q  in  time  0(dnr/d),  using  0(dn1+r/d)  preprocessing. 

Very  recently,  Kushilevitz,  Ostrovsky  and  Ra.ba.ni  [47]  obtained  a.  result  similar  to  Proposition  3 
below. 

Overview  of  Results  and  Techniques.  Our  main  results  are  algorithms2  for  e-NNS  described 
below.3 _ 

1  Throughout. ,  preprocessing  cost,  refers  to  the  space  requirement;  typically,  the  preprocessing  time  is  roughly  the 
same. 

2Our  algorithms  are  randomized  and  return  an  approximate  nearest,  neighbor  with  constant  probability.  To  reduce 
the  error  probability  to  a ,  we  can  use  several  data  structures  in  parallel  and  return  the  best,  result,  increasing 
complexity  by  a  factor  O(logcv), 

3  For  the  sake  of  clarity,  the  O  notation  is  used  to  hide  terms  that,  are  poly-logarithmic  in  n. 
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Proposition  1  For  e  >  0,  there  is  an  algorithm  for  e-NNS  in  under  the  lp  norm  for  p  G  [1,  2] 
which  uses  0(n1+1^1+e^  +  dn )  preprocessing  and  requires  0(dnl^l+f^)  query  time. 

Proposition  2  For  0  <  e  <  1,  there  is  an  algorithm  for  e-NNS  in  WJ  under  any  lp  norm  which 
uses  0(n )  X  0(l/e)d  preprocessing  and  requires  0(d)  query  time. 

Proposition  3  For  any  e  >  0,  there  is  an  algorithm  for  e-NNS  in  lRd  under  the  L  norm  for 
p  G  [1,2]  which  uses  ( nd )°^1)  preprocessing  and  requires  0(d)  query  time. 

We  obtain  these  results  by  reducing  e-NNS  to  a  new  problem,  viz.,  point  location  in  equal  balls. 
This  is  achieved  by  means  of  a  novel  data  structure  called  ring-cover  trees ,  described  in  Section  3. 
Our  technique  can  be  viewed  as  a  variant  of  parametric  search  [53],  in  that  they  allow  us  to  reduce 
an  optimization  problem  to  its  decision  version.  The  main  difference  is  that  in  our  case  in  answering 
a  query  we  can  only  ask  for  a  solution  to  a  decision  problem  belonging  to  a  prespecified  set,  since 
solving  the  decision  problem  (i.e. ,  point  location  in  equal  balls)  requires  data  structures  created 
during  preprocessing.  We  believe  this  technique  will  find  further  applications  to  problems  where 
parametric  search  has  been  helpful. 

fn  Section  4,  we  give  two  solutions  to  the  point  location  problem.  One  is  based  on  a  method 
akin  to  the  Elias  bucketing  algorithm  [64]  —  we  decompose  each  ball  into  a  bounded  number  of  cells 
and  store  them  in  a  dictionary.  This  allows  us  to  achieve  0(d)  query  time,  while  the  preprocessing 
is  exponential  in  d ,  implying  Proposition  2.  For  the  second  solution,  we  introduce  the  technique 
of  locality-sensitive  hashing.  The  key  idea  is  to  use  hash  functions  such  that  the  probability  of 
collision  is  much  higher  for  objects  that  are  close  to  each  other  than  for  those  that  are  far  apart. 
We  prove  that  existence  of  such  functions  for  any  domain  (not  necessarily  a  metric  space)  implies 
the  existence  of  fast  e-NNS  algorithms  for  that  domain,  with  preprocessing  cost  only  linear  in  d 
and  sublinear  in  n.  We  then  present  two  families  of  such  functions  -  one  for  a  Hamming  space  and 
the  other  for  a  family  of  subsets  of  a  set  under  the  resemblance  measure  used  by  Broder  et  al  [fO] 
to  cluster  web  documents.  The  algorithm  based  on  the  first  family  is  used  to  obtain  a  nearest- 
neighbor  algorithm  for  data  sets  from  ffi1.  by  embedding  the  points  from  onto  a  Hamming  cube  in 
a  distance  preserving  manner.  The  algorithm  for  the  resemblance  measure  is  shown  to  have  several 
applications  to  information  retrieval  and  pattern  recognition.  We  also  give  additional  applications 
of  locality  sensitive  hashing  to  dynamic  closest-pair  problem  and  fast  clustering  algorithms.  All 
our  algorithms  based  on  this  method  are  easy  to  implement  and  have  other  advantages  —  they 
exploit  sparsity  of  data  and  the  running  times  are  much  lower  in  practice  [37]  than  predicted  by 
theoretical  analysis.  We  expect  these  results  will  have  a  significant  practical  impact. 

An  elegant  technique  for  reducing  complexity  owing  to  dimensionality  is  to  project  the  points 
into  a  random  subspace  of  lower  dimension,  e.g.,  by  projecting  P  onto  a  small  collection  of  random 
lines  through  the  origin.  Specifically,  we  could  employ  the  result  of  Frankl  and  Maehara  [33], 
which  improves  upon  the  Johnson-Lindenstrauss  Lemma  [42],  showing  that  a  projection  of  P  onto 
a  subspace  defined  by  roughly  9e~2  In  n  random  lines  preserves  all  inter-point  distances  to  within 
a  relative  error  of  e,  with  high  probability.  Applying  this  result  to  an  algorithm  with  query  time 
0(l)rf,  we  obtain  an  algorithm  with  query  time  n9e  ~ .  Unfortunately,  this  would  lead  to  a  sublinear 
query  time  only  for  large  values  of  e.  fn  Section  A  of  the  Appendix,  we  give  a  version  of  the  random 
projection  result  using  a  much  simpler  proof  than  that  of  Frankl  and  Maehara.  We  also  consider 
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the  extensions  of  the  random  projection  approach  to  L  norms  for  p  ^  2.  Using  random  projections 
and  Proposition  2,  we  obtain  the  algorithm  described  in  Proposition  3.  Unfortunately,  the  high 
preprocessing  cost  (its  exponent  grows  with  1/e)  makes  this  algorithm  impractical  for  small  e. 


2  Preliminaries 

We  use  Ip  to  denote  the  space  $ld  under  the  lp  norm.  For  any  point  v  G  we  denote  by  ||u||p  the 
lp  norm  of  the  vector  u;  we  omit  the  subscript  when  p  =  2.  Also,  Hd  =  ({0,  l}rf,  djj)  will  denote  the 
Hamming  metric  space  of  dimension  d.  Let  A4  =  (X,  d)  be  any  metric  space,  P  C  X,  and  p  G  X. 
We  will  employ  the  following  notation:  d(p,  P)  =  min qep  d(p,  q),  and  A(P)  =  ma%.ggp  d(p,  q)  is 
the  diameter  of  P, 

Definition  1  The  ball  of  radius  r  centered  at  p  is  defined  as  B(p,r)  =  {q  €  X  \  d(p,q)  <  r}. 
The  ring  R(p,  ri,  r2)  centered  at  p  is  defined  as  R{p ,  r i,  r2)  =  B(p,  r2)  -  B(p,  ri)  =  €  X  |  ri  < 
d(p,  q)  <  r2j- 

Let  VLd(r)  denote  the  volume  of  a  ball  of  radius  r  in  ld.  The  following  fact  is  standard  [57,  page 

111- 

Fact  1  Let  L(.)  denote  the  gamma  function.  Then  Vd(r) 

2wd'2  , d 

dT{d/2)r  ' 

3  Reduction  to  Point  Location  in  Equal  Balls 

The  key  idea  is  to  reduce  the  e-NNS  to  the  following  problems  of  point  location  in  equal  balls. 

Definition  2  (Point  Location  in  Equal  Balls  (PLEB))  Given  n  radius-r  balls  centered  atC  = 
{ci, . .  . ,  cn }  in  Ai  =  (X,  d) ,  devise  a  data  structure  which  for  any  query  point  q  G  X  does  the  fol¬ 
lowing:  if  there  exists  C{^C  such  that  q  G  B[c{,r )  then  return  aj,  else  return  NO. 

Definition  3  (e-Point  Location  in  Equal  Balls  (e-PLEB))  Given  n  radius-r  balls  centered  at 
C  —  {ci, . . . ,  Cn }  in  J\A  =  (X,  d),  devise  a  data  structure  which  for  any  query  point  q  G  X  does  the 
following: 

•  if  there  exists  c*  G  C  with  q  G  B(q,  r)  then  return  YES  and  a  point  c[  such  that  q  G  B ( c( ,  (1  + 

•  if  q  B(ci ,  (1  +  e)r)  for  all  c,  G  C  then  return  NO, 

•  if  for  the  point  c*  closest  to  q  we  have  r  <  d(q ,  c,)  <  ((1  +  (.)/•/  return  either  YES  or  NO. 

Observe  that  PLEB  (e-PLEB)  can  be  reduced  to  NNS  (e-NNS),  with  the  same  preprocessing 
and  query  costs,  as  follows:  it  suffices  to  find  an  exact  (e-approximate)  nearest  neighbor  and  then 


(2r(i  +  i  m 

F(1  +  n/p) 


■rd  and  Vd(r)  = 
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compare  its  distance  from  q  with  r.  The  main  point  of  this  section  is  to  show  that  there  is  a 
reduction  in  reverse  from  e-NNS  to  e-PLEB,  with  only  a  small  overhead  in  preprocessing  and  query 
costs.  This  reduction  relies  on  a  data  structure  called  a  ring-cover  tree.  This  structure  exploits  the 
fact  that  for  any  point  set  P ,  we  can  either  find  a  ring- separator  or  a  cover.  Either  construct  allows 
us  to  decompose  P  into  smaller  sets  5y,. . . . ,  Si  such  that  for  all  i ,  |54|  <  c|P|  for  some  c  <  1,  and 
1*5*1  <  h|P|  for  b  <  1  +  1/log2  n.  This  decomposition  has  the  property  that  while  searching  P  it 
is  possible  to  quickly  restrict  the  search  to  one  of  the  sets  Sy. 

There  is  a  simpler  but  much  weaker  reduction  from  e-NN  to  e-PLEB.  Let  R  be  the  ratio  of 
the  smallest  and  the  largest  inter-point  distances  in  P.  For  each  l  €  {1  +  e)°,  (1  +  e)1, . . R}, 
generate  a  sequence  of  balls  Bl  =  { B[ ,  . . . .  Bln }  of  radius  l  centered  at  p\, . .  .,pn.  Each  sequence 
Bl  forms  an  instance  of  PLEB.  Then,  given  query  q ,  we  find  via  binary  search  the  minimal  l 
for  which  there  exists  an  i  such  that  q  6  B\  and  return  pi  as  an  approximate  nearest  neighbor. 
The  overall  reduction  parameters  are:  query  time  overhead  factor  O  (log  log  R)  and  space  overhead 
factor  O(logP).  The  simplicity  of  this  reduction  is  very  useful  in  practice.  On  the  other  hand, 
the  0(log  R)  space  overhead  is  unacceptable  when  R  is  large;  in  general,  R  may  be  unbounded.  In 
the  final  version,  we  will  show  that  by  using  a  variation  of  this  method,  storage  can  be  reduced  to 
0(n2\ogn),  which  still  does  not  give  the  desired  0(l/c)d0(n)  bound. 

Definition  4  A  ring  R(p ,  rq,  r2)  is  an  (ay,  a2,  /3)-ring  separator  for  P  if  \PC\  B{p,  ?q)|  >  ay|P| 
and  \P  \  B(p,  r2)\  >  o2|P|.  where  r2/rq  —  ft  >  1,  ay,  a2  >  0. 

Definition  5  A  set  S  C  P  is  a  (7,  <5)-cluster  for  P  if  for  every  p  6  S ,  \P  fl  B{p)  7A(,S'))|  <  h|P|, 
where  0  <  7,  S  <  1. 

Definition  6  A  sequence  of  sets  A{  C  P  is  called  a  ( b ,  c,  d)-cover  for  S  C  P,  if  there 

exists  an  r  >  dA(A)  for  A  =  U,;  A,  such  that  S  C  A  and  for  i  =  1 

•  l*PO  (UpeAtB(p,r))\  <  b\Ai\, 

•  | <  c|P|. 

where  b  >  1,  0  <  c  <  1,  d  >  0. 

Theorem  1  For  any  P,  0  <  a  <  1,  and  ft  >  1 ,  one  of  the  following  two  properties  must  hold: 

1.  P  has  an  (a,  a,  ft) -ring  separator,  or 

2.  P  contains  a  (77-  o)- cluster  of  size  at  least  (f  —  2a)|P|, 

Proof  Sketch:  First  note  that  for  a  >  1/2,  property  (1)  must  be  false  but  then  property  (2)  is 
trivially  true.  In  general,  assume  that  (1)  does  not  hold.  Then,  for  any  point  p  and  radius  r  define: 

•  /“M  =  \P~  B(p,ftr)\, 

•  fp(r )  =  |PnP(p,r)|. 
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Clearly,  /“( 0)  =  n,  ff°( oo)  =  0,  /°(0)  =  0,  and  /°( oo)  =  n.  Also,  notice  that  /“(r)  is  nionoton- 
ically  decreasing  and  f(J  (r)  is  monotonically  increasing.  It  follows  that  there  must  exist  a  choice 
of  r  (say  /•,.)  such  that  ff°(rp)  =  fp(rp).  Since  (1)  does  not  hold,  for  any  value  of  r  we  must  have 
min(/“(r), /°(r))  <  an ,  which  implies  that  /“(rp)  =  fp(rp )  <  an. 

Let  q  be  a  point  such  that  rq  is  minimal.  Define  S  =  P  C\  R(q,  rg,  /3rq)]  it  follows  that  |S'|  > 
(1  —  2a)n.  Also,  notice  that  for  any  s ,  s'  G  S',  d(s ,  s')  <  2/3rg ,  implying  that  A(S')  <  2/3rq.  Finally, 
for  any  s  G  S',  |Pfl  rg)|  <  |F  fli?(s,  rs)|  <  an.  ■ 

Theorem  2  Let  S  be  a  (7,  5) -cluster  for  P.  Then  for  any  b ,  there  is  an  algorithm  which  produces 
a  sequence  of  sets  A\ ,  . . .  ,  Ak  C  P  constituting  a  (6,  5,  ^1+7\7log  — )- cover  for  S. 

Proof  Sketch: 

The  algorithm  below  greedily  computes  a  good  cover  for  S. 


Algorithm  Cover:  S  =  P  C\  R(q,  rg,  flrq)] 


iMS) . 


logb  n 


i  3  t-  0; 


repeat 


j  <-  j  +  1;  choose  some  pj  G  S ;  Bl-  4—  {pj}] 
1  ^ —  1 5 


while  \Pr\UgeB,B(q,r)  \  >  b\Bl-\  do 
B]+l  GPnU!££,%,r); 
i  <r~  i  +  1 

endwhile; 


Aj  <—  B)]  S  <-  5  -  A,;  P  ^  P  -  Ag 
until  S'  =  cf>\ 


k  j. 


In  order  to  prove  the  correctness  of  the  algorithm,  it  suffices  to  make  the  following  four  claims. 

•  S  C  A  =  UjAj  —  Follows  from  the  termination  condition  of  the  outer  loop. 

•  for  all  j  G  {1, . . . ,  k}  and  any  p  G  S',  iPflUqg^  B{p,  r)  \  <  b\Aj  \  —  Follows  from  the  termination 
condition  of  the  inner  loop. 

•  for  all  j  G  {1, . . . ,  k},  |A;|  <  5\P\  —  Clearly,  for  any  j ,  the  inner  loop  is  repeated  at  most 
log bn  times.  Hence,  ma xgeAj  d(pj ,  q)  <  rlogbn  <  7A(S).  As  S'  is  a  (7,  d)-cluster,  we  have 
that  \B(pj,  7A(S))  C  P\  <  ^|P|.  Hence,  |Aj|  <  S\P\. 

•  r  <  (i7a(1o lbn  —  Since  A(A)  ^  A(5)  +  r  loS&  n  =  A(5)  +  7A(5')  =  (1  +  t)a(5'). 


Corollary  1  For  any  P,  0  <  a  <  1,  j3  >  1,  b  >  1,  one  of  the  following  properties  must  hold: 

1.  P  has  an  (a,  a ,  f3)-ring  separator  R[p ,  r,  fir) ,  or 

2.  There  is  a  (6,  tv,  d)-cover  for  some  S'  C  P  such  that  |S|  >  (1  —  2q')tj  and  d  =  (2.'?+i^iog  n 
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3.1  Constructing  Ring-Cover  Trees 


The  construction  of  a  ring-cover  tree  is  recursive.  For  any  given  P  at  the  root,  we  use  properties  (1) 
and  (2)  in  Corollary  1  to  decompose  P  into  some  smaller  sets  S\. . . Si ;  these  sets  are  assigned  to 
the  children  of  the  node  for  P.  Note  the  base  case  case  is  when  P  is  sufficiently  small  and  we  omit 
that  in  this  abstract.  We  also  store  some  additional  information  at  the  node  for  P  which  enables  us 
to  restrict  the  nearest  neighbor  search  to  one  of  the  children  of  P,  by  using  distance  computations  or 
point  location  queries.  For  simplicity,  assume  that  we  can  invoke  an  exact  PLEB  (not  e-PLEB);  the 
construction  can  be  easily  modified  for  approximate  point  location.  There  are  two  cases  depending 
on  which  of  the  two  properties  (1)  and  (2)  holds.  Let  ff  =  2(1  +  7),  b=  1+  ^  n ,  and  a  =  1~1/'2Iogw . 

Case  1.  In  this  case,  we  will  call  P  a  ring  node.  We  define  its  children  to  be  S 1  =  Pfl5(j),  fir) 
and  —  P  B{p ,  r).  Also,  we  store  the  information  about  the  ring  separator  R  at  the  node 

P. 

Case  2.  Here,  we  call  P  a  cover  node.  We  define  Si  =  PH  U  p^A,B(p,  r)  and  So  =  S  —  A.  The 
information  stored  at  P  is  as  follows.  Let  r 0  =  (1  +  l/e)A(A)  and  let  r,-  =  ro/(l  +  <:)'  for 
i  €  {1, . . . ,  k},  where  k  =  log1+e  (1+1/d1°gfcw  +  1.  Notice  that  rk  =  iog7jt|f+q  =  For  each 
rj,  generate  an  instance  of  PLEB  with  balls  B[p)  rf)  for  p  G  A;  all  instances  are  stored  at  P. 

Theorem  3  The  ring-cover  tree  can  he  constructed  in  deterministic  0(n2)  time. 

Proof  Sketch:  The  construction  proceeds  as  follows.  On  each  level,  we  determine  if  a  node  is 
a  ring  or  cover  node;  then  we  compute  the  ring  or  the  cover.  As  the  number  of  levels  is  0(1),  it 
is  sufficient  to  consider  only  the  first  (root)  node  P,  which  we  construct  as  follows.  Firstly,  for 
each  j)  €  P  we  construct  a  list  Lp  containing  all  other  points  q  6  P  sorted  in  increasing  order  of 
d(p,q).  This  takes  0(n2\ogn)  time.  Then,  we  apply  the  method  from  the  proof  of  Theorem  1. 
More  specifically,  we  compute  the  functions  Jff'  and  fp  for  each  p  G  P.  Having  the  lists  Lp  it  can 
be  easily  implemented  in  0(n 2)  time.  Then  we  try  to  find  a  ring;  if  one  is  found  we  are  done.  In 
the  opposite  case  we  find  a  cluster  and  then  apply  the  algorithm  COVER  to  find  a  cover.  It  can  be 
verified  that  this  algorithm  runs  in  time  0(n)  time  assuming  the  lists  Lp  are  given.  The  PLEB 
generation  adds  only  another  0(1)  factor.  Thus  the  total  required  time  is  0(n2).  ■ 

We  now  describe  how  to  efficiently  search  a  ring-cover  tree.  It  suffices  to  show  that  for  any  node 
P  we  can  restrict  the  search  to  one  of  its  children  using  a  small  number  of  tests.  Let  min q(p,p') 
denote  the  point  out  of  p  and  p'  that  is  closer  to  q.  The  search  procedure  is  as  follows;  we  omit  the 
obvious  base  case. 

Procedure  Search: 

1.  if  P  is  a  ring  node  with  an  (tv,  a ,  /3)-ring  separator  R{p)  r,  fir)  then: 

(a)  if  q  6  B(p ,  r(l  +  1/e))  then  return  Search(g,  5'i); 

(b)  else  compute  p'  =  Search(§,  $2)]  return  min q(p,p'). 

2.  if  P  is  a  cover  node  with  a  ( b ,  c,  d)-cover  A\, . . . ,  A[  of  radius  r  for  S  C  P  then: 


(a)  if  q  ((  B(a,ro)  then  compute  p  =  Search(g,  P  —  A),  choose  any  a  €  .4,  and  return 
min  q(p,a)- 

(b)  else  if  q  G  B{a ,  ro)  for  some  a  G  A  but  q  £  B{a' ,  rjf)  for  all  a'  G  A  then  using  binary 
search  on  r8s,  find  an  e-NN  p  of  q  in  A)  compute  p'  =  Search(g,  P  —  A ),  and  return 
min  q(p,p')] 

(c)  else  if  q  G  B(a,  rp-)  for  some  a  G  A{  then  return  Search(g,  Si). 

3.2  Analysis  of  Ring- Cover  Trees 

We  begin  the  analysis  of  the  ring-cover  tree  construction  by  establishing  the  validity  of  the  search 
procedure. 

Lemma  1  Procedure  Searchfq ,  P)  produces  an  c-nearest  neighbor  for  q  in  P. 

Proof  Sketch:  Consider  the  two  cases: 

1.  P  is  a  ring  node. 

(a)  Consider  any  s  G  P  —  S\.  Then  d(s,p)  <  d(.s ,  q)+d(q,p ),  implying  that  d(s ,  q)  >  d(s,p)  — 
d(q,p).  Since  .s  ^  5| .  we  know  that  d(s,p)  >  fir  =  2(l  +  l/e)r,  while  d(p ,  q)  <  r(l  +  l/e). 
Then,  d(s,  q)  >  (1  +  l/e)r  >  d(q,p). 

(b)  For  any  .s  G  B[p)  r),  d(q,p)  <  d(q,  s )  +  d(s,p ),  implying  that  d(q ,  s)  >  d(q,p)  —  d(s,p)  > 

d{q,p)  -  r.  It  follows  that  gj}  <  =  1  +  <  1  +  e. 

2.  P  is  a  cover  node. 

(a)  Similar  to  Case  1(b), 

(b)  Obvious. 

(c)  For  any  p  G  P  —  Si,  d(p ,  a )  >  r.  Since  q  G  B(a,  r*),  we  have  d(q.  a)  <  r*  =  <  d^p_^ . 

■ 

The  proofs  of  Lemmas  2  and  3  are  omitted. 

Lemma  2  The  depth  of  a  ring-cover  tree  is  0{\oglj2a  n)  =  0(log2  n). 

Lemma  3  Procedure  Search  requires  0(log2  n  X  log  k)  distance  computations  or  PLEB  queries. 

Lemma  4  A  ring-cover  tree  requires  space  at  most  0(knbl°Sl/2a  n  (1+2(1-2q))1oSi1)  =  O(npolylogn) 
not  counting  the  additional  non-data  storage  used  by  algorithms  implementing  PLEBs. 

Proof  Sketch:  Let  S(n)  be  an  upper  bound  on  the  space  requirement  for  a  ring-cover  tree  for 
point-set  P  of  size  n.  Then  for  a  cover  node: 

S(n)  <  max  max 

l  A\...Ai,  Ai  disjoint,  |^4*|<cm,  \A\>(l  —  2a)n 

[j2S(b\At\)]  +  S(n-\A\)  +  \A\bk 
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For  a  ring  node: 


S(n)  <  2s(|(l  +  2(l-2a))^  +  l 

The  bound  follows  by  solving  this  recurrence. 


Corollary  2  Given  an  algorithm  for  PLEB  which  uses  f(n )  space  on  an  instance  of  size  n  where 
f(n)  is  convex ,  a  ring-cover  tree  for  an  n-point  set  P  requires  total  space  0(/(npolylog  n)) . 


Fact  2  For  any  PLEB  instance  (C.  r)  generated  by  a  ring-cover  tree, 


A  (C) 


=  O 


l  +  € 

7 


log6  n  . 


4  Point  Location  in  Equal  Balls 

We  present  two  techniques  for  solving  the  e-PLEB  problem.  The  first  is  based  on  a  method  similar 
to  the  Elias  bucketing  algorithm  [64]  and  works  for  any  lp  norm,  establishing  Proposition  2.  The 
second  uses  locality-sensitive  hashing  and  applies  directly  only  to  Hamming  spaces  (this  bears  some 
similarity  to  the  indexing  technique  introduced  by  Greene,  Parnas,  and  Yao  [38]  and  the  algorithm 
for  all-pairs  vector  intersection  of  Karp,  Waarts,  and  Zweig  [48],  although  the  technical  development 
is  very  different).  However,  by  exploiting  Facts  2  and  6  (Appendix  A),  the  instances  of  e-PLEB 
generated  while  solving  e-NN  for  if  can  be  reduced  to  e-PLEB  in  Hm,  where  m  =  d\ogbn  X 
max(l/e,  e).  Also,  by  Fact  5  (Appendix  A),  we  can  reduce  if  to  for  any  p  £  [1,2].  Hence, 
locality-sensitive  hashing  can  be  used  for  any  lp  norm  where  p  £  [1,2],  establishing  Proposition  1.  It 
can  also  be  used  for  the  set  resemblance  measure  used  by  Broder  et  al  [10]  to  cluster  web  documents. 
We  assume,  without  loss  of  generality,  that  all  balls  are  of  radius  1. 

4.1  The  Bucketing  Method 

Assume  for  now  that  p  =  2.  Impose  a  uniform  grid  of  spacing  s  =  tfs/d  on  $ld.  Clearly,  the 
distance  between  any  two  points  belonging  to  one  grid  cuboid  is  at  most  e.  By  Fact  2,  each  side 
of  the  smallest  cuboid  containing  balls  from  C  is  of  length  at  most  0(\/dlogb  n  max(l/e,  e))  times 
the  side-length  of  a  grid  cell.  For  each  ball  Bi ,  define  Bi  to  be  the  set  of  grid  cells  intersecting 
Bi.  Store  all  elements  from  U \{B{  in  a  hash  table  [34,  55],  together  with  the  information  about 
the  corresponding  ball(s).  (We  can  use  hashing  since  by  the  preceding  discussion  the  universe  is 
of  bounded  size.)  After  preprocessing,  to  answer  a  query  q  it  suffices  to  compute  the  cell  which 
contains  q  and  check  if  it  is  stored  in  the  table. 

We  claim  that  for  0  <  e  <  1,  |B|  =  0(\/e)d.  To  see  this,  observe  that  \B\  is  bounded  by  the 
volume  of  a  d-dimensional  ball  of  radius  r  =  2 f e\fd,  which  by  Fact  1  is  2 °(d)rd /dd!2  <  0{l/e)d. 
Hence,  the  total  space  required  is  O(n)  X  0(l/e)d .  The  query  time  is  the  time  to  compute  the  hash 
function.  We  use  hash  functions  of  the  form: 

h((xi, . . . ,  xj))  =  ( a\Xi  +  .  . .  +  a^Xd  mod  P)  mod  M 

where  P  is  a  prime,  M  is  the  hash  table  size,  and  (i\. . . . .  a,j  G  Zf.  This  family  gives  a  static 
dictionary  with  0(1)  access  time  [34].  The  hash  functions  can  be  evaluated  using  0(d)  arithmetic 
operations.  For  general  lp  norms,  we  modify  s  to  e/dl!p .  The  bound  on  \B\  applies  unchanged. 
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Theorem  4  For  0  <  e  <  1,  there  is  an  algorithm  for  e-PLEB  in  l^  using  0(n )  X  0(1/6)^  prepro¬ 
cessing  and  0(1)  evaluations  of  a  hash  function  for  each  query. 

4.2  Locality- Sensitive  Hashing 

We  introduce  the  notion  of  locality- sensitive  hashing  and  apply  it  to  sublinear-time  similarity  search¬ 
ing.  The  definition  makes  no  assumptions  about  the  object  similarity  measure.  In  fact,  it  is  ap¬ 
plicable  to  both  similarity  and  dissimilarity  measures;  an  example  of  the  former  is  dot  product, 
while  any  distance  metric  is  an  instance  of  the  latter.  To  unify  notation,  we  define  a  ball  for  a 
similarity  measure  D  as  B{q,r)  =  {p  :  D{q,p)  >  r}.  We  also  generalize  the  notion  of  e-PLEB  to 
(rii  r2)-PLEB  where  for  any  query  point  q  we  require  the  answer  to  be  YES  if  P  n  B(q,  r2)  ^  0  and 
NO  otherwise. 

Definition  7  A  family  B  =  {h :  S  — »  17}  is  called  (iq,  r2,  pi,  p2)-sensitive  for  D  if  for  any  q,p  G  S 

•  'if  P  €  B(q ,  ri)  then  Pr n[h{q)  =  h(p)\  >  p1; 

•  if  p  i  B(q ,  r2)  then  Pr n[h{q)  =  h(p)\  <  p2. 

In  order  for  a  locality-sensitive  family  to  be  useful,  it  has  to  satisfy  inequalities  pi  >  p2  and  ?q  <  r2 

when  D  is  a  dissimilarity  measure,  or  p\  >  p2  and  r\  >  r2  when  D  is  a  similarity  measure. 

For  k  specified  later,  define  a  function  family  Q  —  {g  :  S  — >  Uk }  such  that  g(p)  —  (hi(p), ,  hj. (p) ) , 
where  hi  €  B.  The  algorithm  is  as  follows.  For  an  integer  /  we  choose  l  functions  g\  from 

Q  independently  and  uniformly  at  random.  During  preprocessing,  we  store  each  p  G  P  in  the 
bucket  gj  (p) ,  for  j  =  1 ,...,!.  Since  the  total  number  of  buckets  may  be  large,  we  retain  only  the 
non-empty  buckets  by  resorting  to  hashing  [34,  55].  To  process  a  query  q ,  we  search  all  buckets 
gi(q),  ■  ■  -  jgiiq)',  as  it  is  possible  (though  unlikely)  that  the  total  number  of  points  stored  in  those 
bucket  is  large,  we  interrupt  search  after  finding  first  21  points  (including  duplicates).  Let  pi, ...  ,pt 
be  the  points  encountered  therein.  For  each  pj,  if  pj  G  B[q)  r2)  then  we  return  YES  and  pj ,  else  we 
return  NO. 

The  parameters  k  and  l  are  chosen  so  as  to  ensure  that  with  a  constant  probability  the  following 
two  properties  hold: 

1.  if  there  exists  p*  G  B(q ,  r\)  then  gj{p*)  =  9j{<l)  for  some  j  =  1 . . and 

2.  the  total  number  of  collisions  of  q  with  points  from  P  —  B(q,  r2)  is  less  than  21,  i.e. 

l 

Y,  5(9T2)>  nflo"1(flj(g))|  <  21. 

1=1 

Observe  that  if  (1)  and  (2)  hold,  then  the  algorithm  is  correct. 

Theorem  5  Suppose  there  is  a  (ri,  r2,  pi,  p2)-sensitive  family  B  for  D.  Then  there  exists  an 
algorithm  for  (ri,  r2)-PLEB  under  measure  D  which  uses  0(dn-\-n1+p)  space  and  0{np )  evaluations 
of  the  hash  function  for  each  query,  where  p  =  . 
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Proof  Sketch:  It  suffices  to  ensure  that  (1)  holds  with  probability  Pi  and  (2)  holds  with  probability 
P2  such  that  both  Pi  and  P2  are  strictly  greater  than  half.  Assume  that  p*  G  B(q,r  1);  the 
proof  is  similar  when  p*  B(q,r2).  Set  k  =  logi /P2  n,  then  the  probability  that  g(p')  —  g(q)  for 
p  G  P  —  B(q,r2)  is  at  most  p\  =  k.  Thus  the  expected  number  of  elements  from  P  -  B(q,r2) 
colliding  with  q  under  fixed  gj  is  at  most  1;  the  expected  number  of  such  collisions  with  any  gj  is 
at  most  l.  Thus  by  Markov  inequality  the  probability  that  this  number  exceeds  21  is  less  than  1/2; 
therefore  the  probability  that  the  property  (2)  holds  is  P2  >  1/2. 

Consider  now  the  probability  of  gj(p*)  =  9j(q)-  Clearly,  it  is  bounded  from  below  by 

pk  =  p1i°Si/r2  n  =  „-} =  „-p. 

Thus  the  probability  that  such  a  gj  exists  is  at  least  Pi  =  1  —  (1  —  n~p)1 .  By  setting  l  =  np  we  get 
Pi  >  1  —  1/e  >  1/2.  The  theorem  follows.  ■ 

We  apply  Theorem  5  to  two  measures:  the  Hamming  metric  and  set  resemblance  [10];  the  latter 
is  a  similarity  measure  defined  for  any  pair  of  sets  A  and  B  as  D(A,  B )  =  |  4lj]b I '  F°r  the 
measure,  we  apply  a  family  of  projections  for  fast  hashing  with  AC0  operations  [6].  For  the  second 
measure,  we  use  sketch  functions  used  earlier  [10]  for  estimation  of  the  resemblance  between  given 
sets  A  and  B. 

Proposition  4  ([6])  Let  S  =  Hd  and  D(p ,  q)  be  the  Hamming  metric  for  p,q  G  H.  Then  for  any 
r,  e  >  0,  the  family  7i  =  {hi  :  hi((bi , . .  .bd))  =  6*,  i  =  1  .  .  .n}  is  (r,  r(l  -f  <.),  I  1  - 

sensitive. 

Corollary  3  For  any  e  >  0,  there  exists  an  algorithm  for  e-PLEB  in  Hd  (or,  ld  for  any  p  6  [1,  2]) 
using  0(dn  +  n1+1/(1+e))  space  and  0(n1^1+fdl)  hash  function  evaluations  for  each  query.  The  hash 
function  can  be  evaluated  using  0(d)  operations. 

Proof  Sketch:  We  use  Proposition  4  and  Theorem  5.  First,  we  need  to  estimate  the  value  of 
p  =  ,  where  pi  =  1  —  j  and  p2  =  1  —  .  Without  loss  of  generality,  we  assume  that 

r  <  since  we  can  increase  dimensionality  by  adding  a  sufficiently  long  string  of  Os  at  the  end 
of  each  point.  Observe  that 

_  In  1/pi  ln  1=773  _  ln(l  -  r/d ) 

P  In  1/ p2  <  In  1_(1^e)r/d  ln(l  -  (1  +  e)r/d) 

Multiplying  both  the  numerator  and  the  denominator  by  j  we  obtain  that: 

£ln(l  ~r/d)  ln(l  -r/d)d!r  U 

P  ~  JhAy^fTTe^Jdf  ~  ln(l  -  (1  +  e)r/d)d/r  ~  T 

In  order  to  upper  bound  p,  we  need  to  bound  U  from  below  and  L  from  above;  note  that  both  U 
and  L  are  negative.  To  this  end  we  use  the  following  inequalities  [55]: 

(1  -  (1  +  e)r/d)d^r  <  e~{1+^  and  (1  -  r / d)d!r  >  e~l  (l  -  -^j-). 
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Therefore 


U 

T 


< 


ln(e  M1  -  577)) 

In  e“(1+e) 

-1  +  ln(l  - 


M1  -  577  ) 


-(i  +  <0 

=  1/(1  +  e)“  i  +  e 

<  1/(1  +  e)  -  ln(l  -  1/ In  n) 


where  the  last  step  uses  the  assumptions  that  e  >  0  and  r  <  We  conclude  that 


ip  <  =  „i/(i+0(i  _  1/ln  n)~lnn  =  0(nll^). 


The  hash  function  evaluation  can  be  made  faster  than  0(d)  for  sparse  data,  i.e.,  when  the 
number  of  non-zero  coordinates  of  a  query  point  is  small.  It  suffices  to  sample  the  bits  from  the 
non-zero  entries  of  the  vectors;  a  similar  method  works  for  the  functions  used  to  build  a  static 
dictionary.  Moreover,  our  experience  is  that  the  preprocessing  space  and  query  time  are  much 
lower  than  the  above  bound  indicates.  In  particular,  we  have  implemented  a  variant  of  the  above 
data  structure  for  the  case  when  data  is  stored  on  disk  [37].  For  a  data  set  of  20,000  d-color 
histograms  for  images  (with  d  ranging  up  to  64)  only  3-9  disk  accesses  were  required  in  order  to 
achieve  small  average  error. 

Proposition  5  ([10])  Let  S  be  the  set  of  all  subsets  of  X  =  {1 . .  .a}  and  let  D  be  the  set  resem¬ 
blance  measure.  Then ,  for  1  >  rq  >  r-i  >  0,  the  following  hash  family  is  (ri,  r?,  W,  rf) -sensitive: 

LL  =  {h7 r  :  h7T(A)  =  max7r(a),  7 r  is  a  permutation  of  X}. 

a  £-4 

Corollary  4  For  0  <  e,  r  <  1,  there  exists  an  algorithm  for  (r,  er)-PLEB  under  set  resemblance 
measure  D  using  0(dn  +  n1+p)  space  and  0(np)  evaluations  of  the  hash  function  for  each  query , 
where  p  =  ■p111 . 

We  now  discuss  further  applications  of  the  above  corollary.  For  any  pair  of  points  p,  q  G  LLd , 
consider  the  similarity  measure  D(p,  q)  defined  as  the  dot  product  p-q.  The  dot  product  is  a  common 
measure  used  in  information  retrieval  applications  [32];  it  is  also  of  use  in  molecular  clustering  [14]. 
By  using  techniques  by  Indyk,  Motwani,  and  Venkatasubramanian  [41]  it  can  also  be  used  for 
solving  the  approximate  largest  common  point  set  problem,  which  has  many  applications  in  image 
retrieval  and  pattern  recognition.  By  a  simple  substitution  of  parameters,  we  can  prove  that  for 
a  set  of  binary  vectors  of  approximately  the  same  weight,  PLEB  under  dot  product  measure  (for 
queries  of  a  fixed  weight)  can  be  reduced  to  PLEB  under  set  resemblance  measure.  The  fixed  weight 
assumption  can  be  easily  satisfied  by  splitting  the  data  points  into  0(log  d)  groups  of  approximately 
the  same  weight,  and  then  making  the  same  partition  for  weights  of  potential  queries. 
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4.3  Further  Applications  of  PLEB  Algorithms 


The  PLEB  procedures  described  above  can  also  be  used  in  cases  where  points  are  being  inserted 
and  deleted  over  time.  In  the  randomized  indexing  method,  insertion  can  be  performed  by  adding 
the  point  to  all  indices,  and  deletion  can  be  performed  by  deleting  the  point  from  all  indices.  In  the 
bucketing  method,  insertion  and  deletion  can  be  performed  by  adding  or  deleting  all  elements  of 
B  in  the  hash  table.  However,  in  order  to  apply  these  methods,  we  have  to  assume  that  the  points 
have  integer  coordinates  with  absolute  value  bounded  by,  say,  M.  Let  n  be  the  maximum  number 
of  points  present  at  any  time. 

Corollary  5  There  is  a  data  structure  for  e-PLEB  in  {1 . .  ,M}d  which  performs  insertions,  dele¬ 
tions,  and  queries  in  time  O  (l/e)d  poly  (log  M,  log  n)  using  storage  0(l/e)dn. 

Corollary  6  There  is  a  data  structure  for  e-PLEB  in  {1 . . .  M ] d  which  performs  insertions,  dele¬ 
tions,  and  queries  in  time  0(dn1^1+e^)  using  storage  0(dn  +  n1+1/(1+d). 

The  latter  corollary  follows  from  the  fact  that  in  order  to  compute  g(q)  we  do  not  need  to  keep 
the  unary  representation  of  p  explicitly.  Rather  than  that,  it  is  sufficient  for  each  coordinate  to 
keep  track  of  the  breakpoint  at  which  the  sampled  bits  are  changing  values  from  0  to  1;  this  clearly 
requires  only  constant  memory  words  per  coordinate. 

By  keeping  several  copies  of  PLEB  as  in  the  simple  method  described  at  the  beginning  of 
Section  3,  we  can  answer  approximate  closest-pair  queries.  It  is  sufficient  to  check  for  every  radius 
whether  any  cell  (in  the  bucketing  method)  or  any  bucket  (in  the  randomized  indexing  method) 
contains  two  different  points;  the  smallest  radius  having  this  property  gives  an  approximation  to 
the  closest-pair  distance.  The  time  bounds  for  all  operations  are  as  in  the  above  corollaries,  but 
multiplied  by  a  factor  0(loglog1+e  M).  It  is  also  easy  to  see  that  the  bichromatic  pair  problem  (in 
which  the  points  are  colored  and  we  consider  only  pairs  of  different  colors)  or  even  multichromatic 
pair  problem  (for  more  than  two  colors)  can  be  solved  withing  the  same  time  bounds. 

Combining  both  techniques,  we  obtain  a  method  for  dynamic  estimation  of  closest  pair.  Epp- 
stein  [28]  showed  recently  that  dynamic  closest-pair  problem  has  many  application  to  hierarchical 
agglomerative  clustering,  greedy  matching  and  other  problems,  and  provided  a  data  structure  mak¬ 
ing  0(n)  distance  computations  per  update  operation.  Our  scheme  gives  an  approximate  answer 
in  sublinear  time.  Moreover  by  an  easy  simulation  of  Kruskal’s  MST  algorithm  using  dynamic 
multichromatic  closest  pair  data  structure,  the  Approximate  Minimum  Spanning  Tree  problem  can 
be  solved  in  time  bounded  by  the  cost  of  approximate  bichromatic  closest  pair  times  0(n).  Thus 
we  obtain  the  first  algorithm  solving  this  problem  in  subquadratic  time  for  any  d. 
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A  The  Dimension  Reduction  Technique 

We  first  outline  our  proof  for  the  random  projections  technique  for  dimension  reduction.  Combining 
this  with  Proposition  2,  we  obtain  the  result  given  in  Proposition  3. 

Definition  8  Let  AA  =  ( X ,  d)  and  Ai'  =  {X' ,  d')  be  two  metric  spaces.  The  space  A4  is  said,  to  have 
a  c-isometric  embedding,  or  simply  a  c-embedding,  in  M'  if  there  exists  a  map  f  :  M  — >  M' 
such  that 

(1  -  e)d(p,  q )  <  d'(f(p),f(q))  <  (1  +  e)d(p,  q) 

for  all.  p,q  6  X.  We  call,  c  the  distortion  of  the  embedding;  if  c  =  1,  we  call,  the  embedding 

isometric. 
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Frank  I  and  Maehara  [33]  gave  the  following  improvement  to  the  Johnson-Lindenstrauss  Lemma  [42] 
on  (1  +  e)-embedding  of  any  S  C  ld  in 

Lemma  5  (Frankl-Maehara  [33])  For  any  0  <  e  <  any  (sufficiently  large)  set  S  of  points  in 
$ld,  and  k  =  [9(e2  —  2e3/3)_1  In  |.S'|]  +  1,  there  exists  a  map  f  :  S  — >  3?*  such  that  for  all  u,  v  G  S, 

(1  -  e)  1 1 u  -  n||2  <  ||  f(u)  -  f(v)  ||2  <  (1  +  c)\\u  -  n||2. 


The  proof  proceeds  by  showing  that  the  square  of  the  length  of  a  projection  of  any  unit  vector 
v  on  a  random  ^-dimensional  hyperplane  is  sharply  concentrated  around  j.  Below  we  prove  an 
analogous  fact.  However,  thanks  to  the  use  of  a  different  distribution,  we  are  able  to  give  a  much 
simpler  proof  and  also  improve  the  constants.  Note  that  the  constants  are  important  as  they  appear 
in  the  exponent  of  the  time  bounds  of  the  resulting  algorithm  described  in  Proposition  3. 

Lemma  6  Let  u  be  a  unit  vector  in  lR.d .  For  any  even  positive  integer  k,  let  U\) . . .  ,Uk  be  random 
vectors  chosen  independently  from  the  d-dimensional  Gaussian  distribution 4  Nd( 0,1).  For  Xi  = 
u  ■  Ui,  define  W  =  W(u)  —  (Xi, . . . ,  X &)  and  L  —  L(u )  =  ||1F||2.  Then ,  for  any  (3  >  1, 

1.  E(L)  =  k. 

2.  Pr[L  >  (3k\  <  0(k)  X  exp(-|(/3  -  (1  +  In  /?))), 

3.  Pr [L  <  k/ffi  <  O(k)  x  exp(— -  (1  -  ln/3))). 


Proof  Sketch:  By  the  spherical  symmetry  of  Nd( 0, 1)  each  Xt  is  distributed  as  iV(0, 1)  [30,  page 
77].  Define  Yi  =  Xfi_1  +  for  i  =  1, . .  ,,k/2.  Then,  Y{  follows  the  Exponential  distribution 

with  parameter  A  =  \  (see  [30,  page  47]).  Thus  E(L)  —  JfiL \  E(Yi)  =  (A/2)  X  2  =  A;  also  one  can 
see  that  L  follows  the  Gamma  distribution  with  parameters  a  —  )  and  v  =  k/2  (see  [30,  page  46]). 
Since  this  distribution  is  a  dual  of  the  Poisson  distribution,  we  obtain  that 

Pr[Z  >  (3k]  =  Prfi^f  <  v  -  1], 

where  Pf1  is  a  random  variable  following  the  Poisson  distribution  with  parameter  at.  Clearly 


v—1 


-at  (atY 


Pr[if  <v-l]  =  '£e 

i=0  *■ 

and  therefore 

Pr[L  >  (3k]  =  =  v(^/3e)^  =  ve^~^P)  . 


8=0 


l\ 
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which  implies  the  desired  result  since  v  =  k/2. 

Finally,  we  claim  the  following  bound,  for  some  large  constant  7>  1. 

Yev/P 


Pr [L  <  k/(3\  =  ]T 


■via 


(v/py 


<  e 


■■-■via 


?(f)  = 


E  (§)'+  t  (ST 

i=1ev/(3+ 1 


ifij 


4 Each  component  is  chosen  independently  from  the  standard  normal  distribution  N( 0,  1). 
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The  second  sum  is  very  small  for  7  >  1  and  we  bound  only  the  first  one.  As  the  sequence  (|| )*  is 
decreasing  for  i  >  v//3,  we  can  bound  the  first  sum  by 

e  '/«  (l)'  =  Ll(r),  '<*  '  " 

Since  v  =  &/2,  we  obtain  the  desired  result.  ■ 

An  interesting  question  is  if  the  Johnson-Lindenstrauss  Lemma  holds  for  other  lp  norms.  A 
partial  answer  is  provided  by  the  following  two  results. 

Theorem  6  For  any  p  £  [1,  2],  any  n-point  set  S  C  lp,  and  any  e  >  0,  there  exist  a  map  f  :  S  — >  l\ 
with  k  =  0(log  n)  such  that  for  all  u,v  €  S. 

(1  -  e) 1 1 u  -  r||.  <  ||/(«)  -  f{v)\\2  <  (1  +  e) 1 1 u  -  r||;.. 

Theorem  7  The  Johnson-Lindenstrauss  Lemma  does  not  hold,  for  llZCl.  More  specifically,  there  is 
a  set  S  of  n  points  in  $td  for  some  d  such  that  any  embedding  of  S  in  has  distortion 

Proof  Sketch:  We  give  a  sketch  of  the  proof  of  Theorem  7  based  on  the  following  two  known 
facts. 

Fact  3  (Linial,  London,  and  Rabinovich  [49])  Every  n-point  metric  M  can  be  isometric-ally 
embedded  in  If,. 

Fact  4  (Linial,  London,  and  Rabinovich  [49])  There  are  graphs  with  n  vertices  which  for  any 
d  cannot  be  embedded  in  if  with  distortion  o(logri). 

Assume  for  contradiction  that  the  Johnson-Lindenstrauss  Lemma  holds  for  l ^  with  distortion 
t,  =  o(logn/\ff).  Then  for  any  graph  G  with  n  vertices:  we  embed  G  in  using  Fact  3;  by 
the  assumption,  we  reduce  the  dimension  to  /  with  distortion  t;  finally,  we  observe  that  as  the 
norms  /£,  and  l(  differ  by  at  most  a  factor  of  \fj,  we  have  an  embedding  of  G  in  I2  with  distortion 
t\/J  =  o(logn),  which  contradicts  Fact  4.  ■ 

Proof  Sketch:  We  give  a  sketch  of  the  proof  of  Theorem  6  based  on  the  following  two  known 
facts. 

Fact  5  (Johnson-Schechtman  [43])  For  any  1  <  p  <  2  and  c  >  0,  there  exists  a  constant  (3  >  1 
such  that  for  all  d>  1,  the  space  ld  has  a  (1  +  c)-embedding  in  if1'. 

Fact  6  (Linial,  London,  and  Rabinovich  [49])  For  any  e  >  0  and  every  n-point  metric  space 
M  =  ( X ,  d)  induced  by  a  set  of  n  points  in  if,  there  exists  m  such  that  M  has  a  (1  +  e)-embedding 
in  //"' .  Lf  all  points  have  coordinates  from  the  set  {1  .  . .  11} .  then  .11  can  be  embedded  isometric-ally 
in  Hm  for  m  —  Rd. 

The  function  /  is  constructed  implicitly  by  a  sequence  of  reductions:  Find  an  (1  +  €i)-isometric 
embedding  of  lp  into  1 1  (using  Fact  5)  and  let  ,S'i  be  the  image  of  S  under  this  mapping.  Find  an 
(1  +  62)-isometric  embedding  of  S'i  into  Hm  (using  Fact  6)  and  let  S2  be  the  image  of  ,S'i  under 
this  mapping.  Notice  that  for  any  s,  s'  6  S 2,  dn(s,  s')  =  ||s  —  s'|| hence  we  may  assume  that  S% 
resides  in  If' .  Finally,  find  an  (1  +  C3)-isometric  embedding  of  S2  into  l\  (using  Lemma  5).  It  is 
now  possible  to  choose  suitable  values  for  ei,  £2,  and  £3  to  obtain  the  desired  result.  ■ 
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